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Abstract 

The class of chain event graph models is a generalisation of the class of discrete Bayesian networks, 
retaining most of the structural advantages of the Bayesian network for model interrogation, propa- 
gation and learning, while more naturally encoding asymmetric state spaces and the order in which 
events happen. In this paper we demonstrate how with complete sampling, conjugate closed form 
model selection based on product Dirichlet priors is possible, and prove that suitable homogeneity 
assumptions characterise the product Dirichlet prior on this class of models. We demonstrate our 
techniques using two educational examples. 
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1. Introduction 

Bayesian networks (BNs) are currently one of the most widely used graphical models for repre- 
senting and analysing finite discrete graphical multivariate distributions with their explicit coding 
of conditional independence relationships between a system's variables However, despite their 

power and usefulness, it has long been known that BNs cannot fully or efficiently represent certain 
common scenarios. These include situations where the state space of a variable is known to depend 
on other variables, or where the conditional independence between variables is itself dependent 
on the values of other variables. Some examples of such latter scenarios are given by Poole and 
Zhang 0]. In order to overcome such deficiencies, enhancements have been proposed to the basic 
Bayesian network in order to create so-called "context-specific" Bayesian networks These have 
their own problems, however: either they represent too much of the information about a model in 
a non-graphical way, thus undermining the rationale for using a graphical model in the first place, 
or they struggle to represent a general class of models efficiently. Other graphical approaches that 
seek to account for "context-specific" beliefs suffer from similar problems. 

This has led to the proposal of a new graphical model — the chain event graph (CEGs) — 
which first propounded in [4j]. As well as solving the aforementioned problems associated with 
Bayesian networks and related graphical models, CEGs are able, not unrelatedly, to encode far 
more efficiently the common structure in which models are elicited — as asymmetric processes — 
in a single graph. To this end, CEGs are based not on Bayesian networks, but on event trees (ETs) 
[^. Event trees are trees where nodes represent situations — i.e. scenarios in which a unit might 
find itself — and each node's extending edges represent possible future situations that can develop 
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from the current one. It follows that every atom of the event space is encoded by exactly one root- 
to-leaf path, and each root-to-leaf path corresponds to exactly one atomic event. It has been argued 
that ETs are expressive frameworks to directly and accurately represent beliefs about a process, 
particularly when the model is described most naturally, as in the example below, through how 
situations might unfold [^. However, as explained in 0], ETs can contain excessive redundancy in 
their structure, with subtrees describing probabilistically isomorphic unfoldings of situations being 
represented separately. They are also unable to explicitly express a model's non-trivial conditional 
independences. The CEG deals with these shortcomings by combining the subtrees that describe 
identical subprocesses (see Q for further details), so that the CEG derived from a particular ET 
has a simpler topology while in turn expressing more conditional independence statements than is 
possible through an ET. 

We illustrate the construction and the types of symmetries it is possible to code using a CEG 
with the following running example. 

Example 1. Successful students on a one year programme study components A and B, but not 
everyone will study the components in the same order: each student will be allocated to study either 
module A or B for the first 6 months and then the other component for the final 6 months. After the 
first 6 months each student will be examined on their allocated module and be awarded a distinction 
(denoted with D), a pass (P) or a fail (F), with an automatic opportunity to resit the module in 
the last case. If they resit then they can pass and be allowed to proceed to the other component of 
their course, or fail again and be permanently withdrawn from the programme. Students who have 
succeeded in proceeding to the second module can again either fail, pass or be awarded a distinction. 
On this second round, however, there is no possibility of resitting if the component is failed. With 
an obvious extension of the labelling, this system can be depicted by the event tree given in Figure 

m 

To specify a full probability distribution for this model it is sufficient to only specify the distri- 
butions associated with the unfolding of each situation a student might reach. However, in many 
applications it is often natural to hypothesise a model where the distribution associated with the 
unfolding from one situation is assumed identical to another. Situations that are thus hypothesised 
to have the same transition probabilities to their children are said to be in the same stage. Thus in 
Example [1] suppose that as well as subscribing to the ET of Figure [1] we want to consider a model 
also embodying the following three hypotheses: 

1. The chances of doing well in the second component are the same whether the student passed 
first time or after a resit. 

2. The components A and B are equally hard. 

3. The distribution of marks for the second component is unaffected by whether students passed 
or got a distinction for the first component. 

These hypotheses can be identified with a partitioning of the non-leaf nodes (situations). In 
Figure [T] the set of situations is 

S = {1^0, A, B, Pi,A,Pi.B,Di,A, Di,b,Fi,a,Fi,b,Pr,a-Pb.,b}- 

The partition C of 5 that encodes exactly the above three hypotheses consists of the stages 
ui = {A,B}, U2 = {Fi^a,Fi,b}, and = {Pi,a, Pi,b, Pr,a, Pr,b, Di^a, Di^b} together with 
the singleton uq = {Vq}. Thus the second stage U2, for example, implies that the probabilities 
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on the edges {Fi^b, Fr,b) and {Fi^a, Fb.,a) are equal, as are the probabihties on {Fi^b, Pr,b) and 
{Fi^A, Pr,a)- Clearly the joint probability distribution of the model - whose atoms are the root to 
leaf paths of the tree - is determined by the conditional probabilities associated with the stages. 
A CEG is the graph that is constructed to encode a model that can be specified through an event 
tree combined with a partitioning of its situations into stages. 

In this paper we suppose that we are in a context similar to that of Example [1] where, for any 
possible model, the sample space of the problem must be consistent with a single event tree, but 
where on the basis of a sample of students' records we want to select one of a number of different 
possible CEG models, i.e. we want to find the "best" partitioning of the situations into stages. 
We take a Bayesian approach to this problem and choose the model with the highest posterior 
probability — the Maximum A Posteriori (MAP) model. This is the simplest and possibly most 
common Bayesian model selection method, advocated by, for example, Dennison et al j6|, Castelo 

01 , and Heckerman Q , the latter two specifically for Bayesian network selection. 

The paper is structured as follows. In the next section we review the definitions of event trees 
and CEGs. In Section [3] we develop the theory of how conjugate learning of CEGs is performed. In 
Section 2] we apply this theory by using the posterior probability of a CEG as its score in a model 
search algorithm that is derived using an analogous procedure to the model selection of BNs. We 
characterise the product Dirichlet distribution as a prior distribution for the CEGs' parameters 
under particular homogeneity conditions. In Section [5] the algorithm is used to discover a good 
explanatory model for real students' exam results. We finish with a discussion. 

2. Definitions of event trees and chain event graphs 

In this section we briefly define the event tree and chain event graph. We refer the interested 
reader to [4j for further discussion and more detail concerning their construction. Bayesian networks, 
which will be referenced throughout the paper, have been defined many times before. See Q for 
an overview. 

2.1. Event Trees 

Let T = {V{T), E{T)) be a directed tree where V{T) is its node set and E{T) its edge set. Let 
S{T) ^ {v w & V{T) - L{T)} be the set of situations of T, where L{T) is the set of leaf (or 
terminal) nodes. Furthermore, define X = {\{vo,v) : v e V{T)\S{T)}, where A(a, 6) is the path 
from node a to node 6, and vq is the root node, so that X is the set of root-to-leaf paths of T. 
Each element of X is called an ATOMIC event, each one corresponding to a possible unfolding of 
events through time by using the partial ordering induced by the paths. Let X(?;) denote the set of 
children of ti S V{T). In an event tree, each situation v S S{T) has an associated random variable 
X{v) with sample space X(w), defined conditional on having reached v. The distribution of X{v) is 
determined by the primitive probabilities {it{v'\v) = p{X{v) = v') : v' S X(i;)}. With random 
variables on the same path being mutually independent, the joint probability of events on a path 
can be calculated by multiplying the appropriate primitive probabilities together. Each primitive 
probability ■n{v'\v) is a colour for the directed edge e — {v,v'), so that we can have 7r(e) = ■7t{v'\v). 

Example 2. Figure\^ shows a tree for two Bernoulli random variables, X and Y , with X occurring 
before Y . In an educational example X could be the indicator variable of a student passing one 
module, and Y the indicator variable for a subsequent module. 
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Figure 2: Simple event tree. The non-zero-probability events in the joint probability distribution of two Bernoulli 
random variables, X and Y, with X observed before Y, can be represented by this tree. Here, all four joint states 
arc possible, because there are four root-to-leaf paths through the nodes. 




Figure 3: Floret of v. This subtree represents both the random variable X{v) and its state space X(t'). 



Here we have random variables X{vo) = X, X{vi) = Y\{X = 0) and X{v2) = Y\{X = 1), 
and primitive probabilities t:{vi\vo) = p{X = 0), 7r(z;3|ui) = p{Y = 0\X = 0) and so on for every 
other edge. Joint probabilities can be found by multiplying primitive probabilities along a path, 
e.g. p{X = 0,Y = 0) = p{X = 0)p{Y = 0\X = 0) = 7r(vi |'Uo)7''(''^3 1'^i) vq and vi are on a path. 

2.2. Chain Event Craphs 

Starting with an event tree T, define a floret of w G S{T) as 

Tiv,T) = {V{T{v,T)),E{Tiv,T))) 

wliere V{T{v, T)) = {v} U {v' e V{T) : {v, v') £ E(T)} and E{T{v, T)) = {e e E(T) :e = {v, v')}. 
The floret of a vertex v is tlius a sub-tree consisting of v, its clrildrcn, and the edges connecting 
V and its children, as shown in Figure [3l This represents, as defined in section 12. H the random 
variable X{v) and its sample space X(t;). 

One of the redundancies that can be eliminated from an ET is that of the florets' edges of two 
situations, v and v' say, which have identical associated edge probabilities despite being defined by 
different conditioning paths. We say these two situations are at the same STAGE. This concept is 
formally defined as follows. 

Definition 3. Two situations v,v' G S(T) are in the same stage u if and only if X{v) and X{v') 
have the same distribution under a bisection 



^u{v, v') : E{T{v, T)) ^ E{T(v', T)) 
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i.e. 

ipuiv,v') : X{v) —> X{v') 

The set of stages of an ET T is written J(T). This set partitions the set of situations S{T). 
We can construct a STAGED tree Q[T,L{T)) with V{g) = V{T), E{g) = E{T), and colour its 
edges such that: 

• If w e u and u contains no other vertices, then all {v,v*) G E{Q) are left uncoloured; 

• If w G M and u contains other vertices, then all {v, v*) e E{Q) are coloured; and 

• Whenever e(u, v*) <—>■ e{v' , v'*) under ipu{v, f '), then the two edges must have the same colour. 

There is another type of situation that is of further interest. When the whole development from two 
situations v and v' have identical distributions, i.e. there exists a bijection between their respective 
subtrees similar to that between stages as defined in Definition 12. 2| then the situations are said to 
be in the same position. This is defined formally as follows. 

Definition 4. Two situations v,v' G S{T) are in the same position w if and only if there exists a 
bijection 

cj)^{v,v'):K{v,T)^K{v',T) 
where A(w, T) is the set of paths in T from v to a leaf node of T , such that 

• all edges in all of the paths in A{v,T) and A{v',T) are coloured in Q{T, L{T)); and 

• for every path X{v) G A{v,T), the ordered sequence of colours in X{v) equals the ordered 
sequence of colours in A(w') :— (f)w{v,T){X{v)) G A{v',T) 

This ensures that when v and v' are in the same position, then under the map (pwiv, v') future 
development from either node follows identical probability distributions. 

We denote the set of positions as K{T). Positions are an obvious way of equating situations, 
because the different conditioning variables of different nodes in the same position have no effect 
on any subsequent development. It is clear that K{T) is a finer partition of V{T) than J{T), and 
indeed that J(T) partitions K{T), as situations in the same position will also be in the same stage. 

We now use stages and positions to compress the event tree into a chain event graph. First, the 
PROBABILITY GRAPH of the event tree 

niGiT)) - H{T) = iv{n),E{n)) 

is drawn, where V{H) — K(T) U {woo} and E{T-C) is constructed as follows. 

• For each pair of positions w, w' G K{T), if there exists v, v' G S{T) such that v G w,v' G w' 
and e{v,v') G E{T), then an associated edge e{w,w') G E{H) is drawn. Furthermore, if for 
a position w there exists v G S{T), v' G L{T) and e{v,v') G E{T) such that v G w, then an 
associated edge e{w,Woo) & E{H) is drawn. 

• The colour of this edge, e(w, w'), is the same as the colour of the associated edge e{v, v'). 

Now the CEG can finally be constructed by taking the probability graph Ti.{T) and connecting 
the positions that are in the same stage using undirected edges: Let C(T) be a mixed graph with 
vertex set V{C) = V{H), directed edge set Ed{C) = E{H), and undirected edge set -E„(C) = 
{{w,w') : u{w) — u{w'), w,w' G V{C)}. 

An example of a CEG that could be constructed from the event tree in Figure [T] is shown in 
Figure 15.11 
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3. Conjugate learning of CEGs 

One convenient property of CEGs is that conjugate updating of the model parameters proceeds 
in a closely analogous fashion to that on a BN. Conjugacy is a crucial part of the model selection 
algorithm that will be described in Section IH because it leads to closed form expressions for the 
posterior probabilities of candidate CEGs. This in turn makes it possible to search the often very 
large model space quickly to find optimal models. We demonstrate here how a conjugate analysis 
on a CEG proceeds. 

Let a CEG C have set of stages J(C) = {ui, . . . ,Uk}, and let each stage Ui have ki emanat- 
ing edges (labelled ei, . . . ,6^. ) with associated probability vector tt^ — {'nii,TTi2, . . . ,T^iki)' (where 
"Y^^^i T^ij = 1 E^nd TTij > for j S {1, . . . , fc}). Then, under random sampling, the likelihood of the 
CEG can be decomposed into a product of the likelihood of each probability vector, i.e. 

A; 

p(a;|7r, C) = J|pi(a;i|7ri, C) 

1=1 

where tt = {tti, 7r2, . . . , tt^:}, and x — {xi, . . . , Xk} is the complete sample data such that each 
Xi = (xii, ■ • • , XikiY is the vector of the number of units in the sample (for example, the students in 
Example 1) that start in stage Ui and move to the stage at the end of edge for j g {1,. . . ,ki}. 
If it is further assumed that Xi _LL Xj\TT, Mi ^ j then 

ft(a;,|7r„C) = n<*^ (1) 

Thus, just as for the analogous situation with BNs, the likelihood of a random sample also separates 
over the components of tt. With BNs, a common modelling assumption is of local and global 
independence of the probability parameters the corresponding assumption here is that the 
parameters 7ri,7r2,. . .,7rfe of tt are all mutually independent a priori. It will then follow, with the 
separable likelihood, that they will also be independent a posteriori. 

If the probabilities tt^ are assigned a Dirichlet distribution, Dir(Q;i), a priori, where a.i ~ 
(ofii, ai2, . . . , aik-Y , so that for values of Tr.y such that X^^Li ""y = 1 '^ij > for 1 < j < ki, the 
density of tt^, qi{7Zi\C), can be written 

k ■ 

I r(o;a + . . . + gjfcj a 1 

'^'^"^"'^^ r(a.o...r(a...) 

where T{z) = t^~^e~*dt is the Gamma function. It then follows that TTi\x {— TTi\xi) also has 
a Dirichlet distribution, Dir(a*), a posteriori, where a* = (a*j^, . . . , a*^.)', a*j = Uij + Xij for 
1 < j < ki,l < i < k. The marginal likelihood of this model can be written down explicitly as the 
function of the prior and posterior Dirichlet parameters: 



p{x\c) = n 



r(E,«^r,-)fJr(a,,) 
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The computationally more useful logarithm of the marginal likelihood is therefore a linear combi- 
nation of functions of and a*j . Explicitly, 

k k 

l0gp{x\C) = HC^^) - ^K)] + E [*("^*) - ^("■')] 

1=1 i=l 

where for any vector c = (ci, C2, . . . , c„)', 

n n 

s(c) = logr(Ec„) and i(c) ^^XogVic,) (3) 

So the posterior probability of a CEG C after observing a;, (/(Cja;), can be calculated using 
Bayes' Theorem, given a prior probability q{C): 

logq{C\x) ^logp{x\C) + logq{C) + K (4) 

for some value K which does not depend on C. This is the SCORE that will be used when searching 
over the candidate set of CEGs for the model that best describes the data. 



4. A Local Search Algorithm for Chain Event Graphs 

Preliminaries 

With the log marginal posterior probability of a CEG model, logg(C|a;), as its score, searching 
for the highest-scoring CEG in the set of all candidate models is equivalent to trying to find the 
Maximum A Posteriori (MAP) model The intuitive approach for searching C, the candidate set 
of CEGs — calculating q{C\x) (or \ogq{C\x)) for every C G C and choosing C* :— maxc q{C\x) — 
maxc logg(C|a;) — is infeasible for any but the most trivial problems. We describe in this section 
an algorithm for efficiently searching the model space by reformulating the model search problem 
as a clustering problem. 

As mentioned in Section \2.2\ every CEG that can be formed from a given event tree can be 
identified exactly with a partition of the event tree's nodes into stages. The coarsest partition Coo 
has all nodes with k outgoing edges in the same stage, Uk] the finest partition Co has each situation 
in its own stage, except for the trivial cases of those nodes with only one outgoing edge. Defined 
this way, the search for the highest-scoring CEG is equivalent to searching for the highest-scoring 
clustering of stages. _ 

Various Bayesian clustering algorithm exist including many involving MCMC [l^. We show 
here how to implement an Bayesian agglomerative hierarchical clustering (AHC) exact algorithm 
related to that of Heard et al [13| . The AHC algorithm here is a local search algorithm that begins 
with the finest partition of the nodes of the underlying ET model (called Cq above and henceforth) 
and seeks at each step to find the two nodes that will yield the highest-scoring CEG if combined. 

Some optional steps can be taken to simplify the search, which we will implement here. The 
first of these involves the calculation of the scores of the proposed models in the algorithm. By 
assuming that the probability distributions of stages that are formed from the same nodes of the 
underlying ET are equal in all CEGs, i.e. p{xi\TTi,Ci) — p(a;i|7ri, C2), VCi, C2 G C, it becomes 
more efficient to calculate the differences of model scores, i.e. the logarithms of the relevant Bayes 
factors, than to calculate the two individual model scores absolutely. This is because, if for two 
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CEGs their stage sets J(Ci) and J{C2) differ only in tliat stages uia,uu, G Ci are combined into 
U2c S C2 , with all other stages unchanged, then the calculation of the logarithm of their posterior 
Bayes factor depends only on the stages involved; using the notation of Equation ^ , 

\og^^^^^^logq{C^\x)-\ogq{C2\x) (5) 
q{C2\x) 

= logg(Ci) - \0gqiC2) + log-z(a;|Ci) - logg(a;|C'2) (6) 
= logg(Ci) - logg(C2) + J2 [^("i') " + E " ^("i')] 



(7) 



= loggr(Ci) - logg(C2) + s{aia) - s{al^) + t^al^) - t{ckia) 

+ s{aib) - s{al^) + t{al^) - t{aib) (8) 

- S{a2c) + S{a2c) - t{0i2c) + t{OL2c) 

Using the trivial result that for any three CEGs 

logg(C3|a;) -logg(C2|a;) = [log 9(^3 - log g(Ci|a;)] - [log g(C2|a;) - log g(Ci , 

it can be seen that in the course of the AHC algorithm, comparing two proposal CEGs from the 
current CEG can be done equivalently by comparing their log Bayes factors with the current CEG, 
which as shown above requires fewer calculations. 

The calculation of the score for each CEG C, as shown by Equation ^ , shows that it is formed of 
two components: the prior probability of the CEG being the true model and the marginal likelihood 
of the data. These must therefore be set before the algorithm can be run, and it is here that the 
other simplifications are made. 



4-. 2. The prior over the CEG space 

For any practical problem C, the set of all possible CEGs for a given ET, is likely to be a very 
large set, making setting a value for (7(C), VC G C a non-trivial task. An obvious way to set a 
non- informative or exploratory prior is to choose the uniform prior, so that q{C) — -j^. This has 
the advantages of being simple to set and of eliminating the logg(Ci) — logq(C2) term in Equation 

A more sophisticated approach is to consider which potential clusters are more or less likely 
a priori, according to structural or causal beliefs, and to exploit the modular nature of CEGs by 
stating that the prior log Bayes factor of a CEG relative to Cq is the sum of the prior log Bayes 
factors of the individual clusters relative to their components completely unclustered, and that 
these priors are modular across CEGs. This approach makes it simple to elicit priors over C from 
a lay expert, by requiring the elicitation only of the prior probability of each possible stage. 

A particular computational benefit of this approach is when the prior Bayes factor of any CEG 
C with Co is believed to be zero, because one or more of its clusters is considered to be impossible. 
This is equivalent in the algorithm to not including the CEG in its search at all, as though it was 
never in C in the first place, with the obvious simplification of the search following. 
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4-3. The prior over the parameter space 

Just as when attempting to set q{C), the size of most CEGs in practise leads to intractabihty of 
setting p{x\C) for each CEG C individually. However, the task is again made possible by exploiting 
the structure of a CEG with judicious modelling assumptions. 

Assuming independence between the likelihoods of the stages for every CEG, so that p{x\Tr, C) is 
as determined by Equation ([T]), and the fact that p{x\C) = J p{x\tz, C)p{TT\C)d7r, it is clear that to 
set the marginal likelihood for each CEG is equivalent to setting the prior over the CEG's parame- 
ters, i.e. setting p{n\C) for each C. With the two further structural assumptions that the stage priors 
are independent for all CEGs (so that p{tt\C) — Y[i=iP{''^i\^)) that equivalent stages in differ- 
ent CEGs have the same prior distributions on their probability vectors, (i.e. p{7Ti\Ci) = p{TTi\C2)), 
it can be seen that the problem of setting p(a;|7r, C) is reduced to setting the parameter priors of 
each non-trivial floret in Cq {p{TVi\Co),i — 1, . . . , fc) and the parameter priors of stages that are 
clusters of stages of Cq. 

The usual prior put on the probabili ty p arameters of finite discrete BNs is the product Dirichlet 
distribution. In Geiger and Heckerman 14] the surprising result was shown that a product Dirichlet 
prior is inevitable if local and global independence are assumed to hold over all Markov equivalent 
graphs on at least two variables. In this paper we show that a similar characterisation can be 
made for CEGs given the assumptions in the previous paragraph. We will first show that the floret 
parameters in Co must have Dirichlet priors, and second that all CEGs formed by clustering the 
florets in Co have Dirichlet priors on the stage parameters. One characterisation of Co is given by 
Theorem [S] 

Theorem 5. // it is assumed a priori that the rates at which units take the root-to-leaf paths in 
Co are independent ("path independence" ) and that the probability of which edge units take after 
arriving at a situation v is independent of the rate at which units arrive at v ( "floret independence "), 
then the non-trivial florets of Co have independent Dirichlet priors on their probability vectors. 

Proof. The proof is in the Appendix. 

Thusp(7ri|Co) is entirely determined by the stated rates 7(A) on the root-to-leaf paths A e A(Cq) 
of Co- This is similar to the "equivalent sample sizes" method of assessing prior uncertainty of 
Dirichlet hyperparameters in BNs as discussed in Section 2 of Heckerman Q . 

Another way to show that all non-trivial situations in Co have Dirichlet priors on their param- 
eter spaces is to use the characterisation of the Dirichlet distribution first proven by Geiger and 
Heckerman [l3l, repeated here as Theorem [6l 

Theorem 6. Let {9ij},l < i < k,l < j < n^'^^^Oij = 1, where k and n are integers greater 
than 1. be positive random variables having a strictly positive pdf fu{{Oij}). Define 9i, = X]j=i ^ij' 

Then if {di.,Oj\i, . . . ,^,/|fc} mutually independent, fu{{(^ij}) is Dirichlet. 
Proof. Theorem 2 of Geiger and Heckerman [l^ . 

Corollary 7. //Cq has a composite number m of root-to-leaf paths and all Markov equivalent CEGs 
have independent floret distributions then the vector of probabilities on the root-to-leaf paths of Cq 
must have a Dirichlet prior. This means in particular that, from the properties of the Dirichlet 
distribution, the floret of each situation with at least two outgoing edges has a Dirichlet prior on its 
edges. 
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Proof. Construct an event tree Cg with m root-to-leaf paths, where the floret of the root node 
Vq has k edges and each of the florets extending from the children of v'q have n edges terminating 
in leaf nodes, where m = kn, fc > 2, n > 2. This will always be possible with a composite m. Cq 
describes the same atomic events as Cq with a different decomposition. 

Let the random variable associated with the root floret of Cq he X, and let the random variable 
associated with each of the other florets be Y\X — i,i — 1, . . . ,/c. Let 6*^ = P{X = i,Y = j). 
Then by the deflnition of event trees, P{dij > 0) > 0, 1 < i < fc, 1 < j < n and ^ dij — 1. By the 
notation of Theorem H 6'i. = P{X = i) and 9ji, = P(Y ^ j\X ^ i). 

By hypothesis the floret distributions of Cq are independent. Therefore the condition of Theorem 
[S] holds and hence fu{(^ij) is Dirichlet. From the equivalence of the atomic events, the probability 
distribution over the root-to-leaf path probabilities of Cq is also Dirichlet, and so by Lemma (THl all 
non-trivial florets of Cq therefore have Dirichlet priors on their probability vectors. 

To show that the stage parameters of all the other CEGs in C have independent Dirichlet priors, 
an inductive approach will be taken. Because of the assumption of consistency - that two identically 
composed stages in different CEGs have identical priors on their parameter space - for any given 
CEG C whose stages all have independent Dirichlet priors on their parameters spaces, it is known 
that another CEG C* formed by clustering two stages itic, U2c from C into one stage Uc* will have 
independent Dirichlet priors on all its stages apart from Uc* ■ It is thus only required to show that 
TTc* has a Dirichlet prior. We prove this result for a class of CEGs called regular CEGs. 

Definition 8. A stage u is regular if and only if every path X G A(C) contains either one 
situation in u or none of the situations in u. 

Definition 9. A CEG is regular if and only if every situation u G u[C) is regular. 

Theorem 10. Let C be a regular CEG, and let C* be the CEG that is formed from C by setting 
two of its stages, uic and U2c, as being in the same stage Uc* , where Uc* is a regular stage, with all 
other attributes of the CEG unchanged from C . 

If all stages in C have Dirichlet priors, then assuming that equivalent stages in different CEGs 
have equivalent priors, all stages in C* have Dirichlet priors. 

Proof. Without loss of generality, let all situations in uic and U2c have s children each, and let 
the total number of situations in uic and U2c be r. Thus there are r situations in Uc* , each with 
s children. By the assumption of prior consistency across stages, all stages in C* have Dirichlet 
priors on their parameter spaces, so it is only required to prove that Uc* has a Dirichlet prior. 

Consider the CEG C formed as follows: Let the root node of C, vq, have 2 children, vi and 
v' . Let w' be a terminal node, and let vi have r children, {tii(l), . . . ,vi{r)}, which are equivalent 
to the situations in Uc* , including the property that they are in the same stage Uc' ■ Lastly, let the 
children of {^1(1), . . . ,vi{r)}, {fi(l, 1), . . . ,Vi{l, s), . . . ,vi{r,l), . . . ,vi{r, s)}, he leaf nodes in C . 

By construction, the prior for Uc' is the same as that for Uc* ■ 

Now construct another CEG C*' from C by reversing the order of the stages vi and Uc' • The 
new CEG has root node vq with the same distribution as vq G C". vq now has two children v' - 
the same as before - and V2, which has s children {v2{l), . . . , V2{s)} in the same stage. Each node 
V2{i), i = 1, . . . , s has r children V2(i,l), . . . , V2{i, r), all of which are leaf nodes. 

The two CEGs C*' and C" are Markov equivalent, as it is clear that P{vi{i,j)) = P{v2{j, «)), i = 
1, . . . ,r,j — 1, . . . , s. The probabilities on the floret of V2 are thus equal to the probabilities of the 
situations in the stage of u^' , and hence Uc- ■ Because V2 is a stage with only one situation. Theorem 
[5] implies that it has a Dirichlet prior. Therefore u^* has a Dirichlet prior. 
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An alternative justification for assigning a Dirichlct prior to any stage that is formed by clus- 
tering situations with Dirichlet priors on their state spaces can be obtained which does not depend 
on assuming Markov equivalency between CEGs derived from different event trees by assuming a 



property analogous to that of "parameter modularity" for BNs [15| . This property states that the 



distribution over structures common to two CEGs should be identical. 

Definition 11. Let u be a stage in a CEG C composed of the situations Vi, . . . ,f„ from Cq, each 
of which has m children Vn, . . . , Vim, i — 1, . . . ,n such that Vij are the same colour for all i for each 
j. Then u has the property of MARGIN EQUIVALENCY if 

■nuj — P{vij or V2j or . . . or Vnj\vi or V2 or ... or Vn) (9) 
is the same for both C and Cq for j = 1, . . . , m. 

Definition 12. C has margin equivalency if all of its stages have margin equivalency. 

Theorem 13. Let he a stage as defined in Definition l 11\ with m > 2. Then assuming independent 
priors between the situations for the associated finest-partition CEG Cq of C , TTy. ~ Dir^cXi) where 
OLi — {oi-iiT ■ ■ ,otim) for cach Vi, i — l,...,n. Furthermore, for both C and Cq, 7r„ ~ Dir(a„), 
where a„ = {Y,i aa, • • ■ , I]i "™)- 

Proof. From Theorem [5] or Corollary [7], every non-trivial floret in Cq has a Dirichlet prior on 
its edges, which includes in this case the situations vi, . . . ,Vn- 

Let 7y = ^TTij for i ~ l,...,n, j = l,...,r7i for some 7 £ Then it is a well-known fact 
that 7,;j ~ Gamma(ay , /3) for all 1 < i < n, 1 < j < m for some /3 > 0, and that _LLj7,j . As 
_LLj 77^,. , ^Lij ^ij. Then by Lemma [T51 letting I[j] be the set of edges {e^j = e{vi,Vij),i = 1, . . . , n} 
for j = 1, . . . ,TO, 

7r„ ~ Dir(^ a^i , . . . , ^ 

i i 

By margin equivalency, 7r„ must be set the same way for C. 

Note that the posterior of tTu for a stage u that is composed of the Cq situations vi, . . . ,Vn 
is thus TTulx ~ Dir(Q:*) where a* = q:„ -f- Xu — X]r=i + Equation ([5]), therefore, 

becomes 



( C 

^7777^ = logg(Ci) - logg(C2) + s{aia) - s{al^) + t{alJ - 

q(Ly2\X) 

+ s{aib) - s{al^) + t(aih) - t{aib) ~ s{aia + au) 

+ Sia*ia + "lb) - tif^la + "ifc) + *("la + "ifc) (H) 



4.4- The algorithm 

The algorithm thus proceeds as follows: 
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1. Starting with the initial ET model, form the CEG Cq with the finest possible partition, where 
all leaf nodes are placed in the terminal stage Mqo and all nodes with only one emanating edge 
are placed in the same stage. Calculate \ogq{Co\x) using 

2. For each pair of situations Vi,Vj G Co with the same number of edges, calculate log j^j 
where Ci is the CEG formed by having Vi, Vj in the same stage and keeping all others in their 
own stage; do not calculate if q{C*) — 0. 

3. Let Ci = maxc-(loggg;^). 

4. Now calculate C2 for each pair of stages in Ci except where (7(C|) — 0, and record C2 — 
max(g(C2*|£c)). 

5. Continue for C3, C4 and so on until the coarsest partition Coo has been reached. 

6. Find C = max(Co, Ci, . . . , Coo), and select this as the MAP model. 

We note that the algorithm can also be run backwards, starting from Coo and splitting one 
cluster in two at each step. This has the advantage of making the identification of positions in the 
MAP model easier. 



5. Examples 

5.1. Simulated data 

To first demonstrate the efficacy of the algorithm described above we implement the algorithm 
using simulated data for Example[Tl where the CEG generating the data was as known and described 
in Section [T] Figure 15.11 shows the number of students in the sample who reached each situation in 
the tree. 

In this complete dataset the progress of 1000 students has been tracked through the event tree. 
Half are assigned to take module A first and the other half B. By finding the MAP CEG model in 
the light of this data we may find out whether the three hypotheses posed in the introduction are 
valid. We repeat them here for convenience: 

1. The chances of doing well in the second component are the same whether the student passed 
first time or after a resit. 

2. The components A and B are equally hard. 

3. The distribution of marks for the second component is unaffected by whether students passed 
or got a distinction for the first component. 

We set a uniform prior on the CEG priors and on the root-to-leaf paths of Co , the finest partition 
of the tree, for illustration purposes. The algorithm is then implemented as follows. 

There are only two florets with two edges; with Beta(l,3) priors on each and a Beta(2,6) prior 
on the combined stage, the log Bayes factor is -1.85. Carrying out similar calculations for all the 
pairs of nodes with three edges, it is first decided to merge the nodes Pi^a and Pi,b, which has a 
log Bayes factor of -3.76 against leaving them apart. Applying the algorithm to the updated set of 
nodes and iterating, the CEG in Figure [5Tl is found to be the MAP one. 

Under this model, it can be seen that all three hypotheses above are satisfied and that the MAP 
model is the correct one. 
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Figure 5: The MAP CEG for that event tree in Figure [SH 



5.2. Student test data 

In our second example we apply the learning algorithm to a real dataset in order to test the 
algorithm's efhcacy in a real-life situation and to identify remaining issues with its usage. The 
dataset we used was an appropriately disguised set of marks taken over a 10-year period from four 
core modules of the MORSE degree course taught at the University of Warwick. A part of the 
event tree used as the underlying model for the first two modules is shown in Figure [5T^ along with 
a few illustrative data points. This is a simplification of a much larger study that we are currently 
investigating but large enough to illustrate the richness of inference possible with our model search. 

For simplicity, the prior distributions on the candidate models and on the root-to-leaf paths for 
C'o were both chosen to be uniform distributions. 

The MAP CEG model was not Cq, so that there were some non-trivial stages. In total, 170 
situations were clustered into 32 stages. Some of the more interesting stages of this model are 
described in Table [TJ 



Stage 


Probability vector 


Students 


Situations 


Locations 


Comments 


7 


(0.47, 0.44, 0.08) 


685 


2 


1; 1,1,1 


High achievers 


11 


(0.22, 0.43, 0.35) 


412 


6 


3; 1,2; 3,1; 
1,1,3 


Middling stu- 
dents 


13 


(0.33, 0.33, 0.33) 


16 


18 


4; 4,2; 4,3 


No students ap- 
peared in 17 of 
these situations 


17 


(0.07, 0.27, 0.66) 


86 


4 


1,3; 3,2; 
3,2,4 


Struggling stu- 
dents 


27 


(0.19, 0.56, 0.25) 


464 


7 


1,1,4; 1,2,2; 
1,3,2; 1,4,2 


More likely to 
get grade 2 than 
stage 11 


28 


(0.11, 0.51, 0.38) 


436 


6 


1,2,3; 3,1,3; 
1,2,4 


More likely to 
get grade 3 than 
stage 27 



Table 1: Selected stages of MAP CEG model formed from data described in Section l5.2l The columns respectively 
detail the stage number, posterior expectation of the probability vector of that stage (rounded to two decimal places), 
number of students passing through that stage in the dataset, number of situations from the original ET in that 
stage, examples of situations in that stage (shown as sequence of grades, where "4" means that grade is missing), 
and any comments or observations related to that stage. 
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Figure 6: Sub-tree of the event tree of possible grades for the MORSE degree course at the University of Warwick. 
Each floret of two edges describes whether a student's marks are available for a particular module (denoted by the 
edge labelled A for the first module) or whether they are missing (NA). If they are available, then they are counted 
as grade 1 if are 70% or higher, grade 2 if they are between 50% and 69% inclusive, and grade 3 if they are below 
50%. Some illustrative count data are shown on corresponding nodes. 
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From inspecting the membership of stages it was possible to identify various situations which 
were discovered to share distributions. From example, students who reach one of the two situations 
in stage 7 have an expected probability of 0.47 in getting a high mark, an expected probabiUty of 
0.44 of getting a middhng grade, and only an expected probability of 0.08 of achieving the lowest 
grade. From being in a stage of their own, it can be deduced that students in these situations 
have qualitatively different prospects from students in any other situations. In contrast, students 
who reach one of the four situations in stage 17 have an expected probability of 0.66 of getting the 
lowest grade. 

6. Discussion 

In this paper we have shown that chain event graphs are not just an efficient way of storing the 
information contained in an event tree, but also a natural way to represent the information that 
is most easily elicited from a domain expert: the order in which events happen, the distributions 
of variables conditional on the process up to the point they are reached, and prior beliefs about 
the relative homogeneity of different situations. This strength is exploited when the MAP CEG is 
discovered, as this can be used in a qualitative fashion to detect homogeneity between seemingly 
disparate situations. 

There are a number extensions to the theory in this paper that are currently being pursued. 
These fall mostly into the two categories: creating even richer model classes than those considered 
here; and developing even more efficient algorithms for selecting the MAP model in these model 
classes. 

The first category includes dynamic chain event graphs. This framework can supply a number of 
different model classes. The simplest case involves selecting a CEG structure that is constant across 
time, but with a time series on its parameters. A bigger class would allow the MAP CEG structure 
to change over time. These larger model classes would clearly be useful in the educational setting 
considered in this paper, as they would allow for background changes in the students' abilities, for 
example. 

Another important model class is that which arises from uncertainty about the underlying event 
tree. A similar model search algorithm to the one described in this paper is possible in this case 
after setting a prior distribution on the candidate event trees. 

In order to search any of these model classes more effectively, the problem of finding the MAP 
model can be reformulated as a weighted MAX-SAT problem, for which algorithms have been 
developed. This approach was used to great effect for finding a MAP BN by Cussens 'IS]. 

Appendix 

Theorem [S] is based on three well-known results concerning properties of the Dirichlet distribu- 
tion, which we review below. 

Lemma 14. Let jj ~ Gamma(Q;j, /?), j = 1, . . . ,n where aj > for j G {1, . . . /3 > and 
IL 7i. Furthermore, let 9j = — for j £ {1, . . . , n}, where 7 = X)"=i 7j- 

Then 6 = (6'i)i={;i,...^„} ~ Dir (ai, . . . ,a„). 
Proof. Kotz et al [17]. 
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Lemma 15. Let I[j] C {1, . . . , n}, 7(/[j]) = E»g/[j] ^"'^ = E*e/[j] 

T/ien /or any partition I — {/[I], • • . , I[k]} of {1, . . . , n}, 

e{I) = {e{I[l]), em), e{I[k])) ~ Dir (a(/[l]), . . . , a(/[fc])) 
w/iere a(/[j]) = EiG7[i] 

Proof. For any C {l,...,n}, IL 7^, 7(/[j]) ~ Gamma (af/hl), /?) (a well-known result; 

ie/[j] 

see, for example, Weatherburn [iSj), and for any partition / = {/[I], . . . , /[fc]} of {l,...,7i}, 
-0- Therefore, as 

i£{l,...,k} 

and 7 = J2i=i 7('^[*]); the result follows from Lemma [Til 
Lemma 16. For any I[j] C {1, . . . ,n} where \ > 2, 



Proof. Wilks m 



Theorem 17. Let the rates of units along the root-to-leaf paths G A, i S {1, . . . , |A|} of an event 

tree T have independent Gamma distributions with the same scale parameter, i.e. "fi 7(Ai) ~ 

Gamma(Q!i, i G {1,...,|A|} and _LL 7^. Then the distribution on each floret in the tree 

ie{i,...,|A|} 

will be Dirichlet. 

Proof. Consider a floret J- with root node v and edge set {ei, . . . ,e/}. The rate for each edge 
Si, 7(6^)7 is equal to j{Xei), where Ae^ is the root-to-leaf path that intersects with ei, so that 
7(ei) ~ Gammafcke. , /?) and JL jici)- 

Let / = {I[T],L[y]} partition A, where L[T] = {Aei, . ■ ■ , AeJ and L[T] = L - L[J^]. Then by 
Lemma [TBI the probability vector on T is Dirichlet, where 

0i[yr] ~ Dir ((aejje{i,. ..,;}) 
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